The aim for this analysis was to investigate the molecular heterogenity in PDX (Patient-Derived-Xenografts) models, which are human tumor grafts transplanted in murine models in order to study their molecular evolution.
The project particularly focuses of the molecular signature of the ribosomal proteins (RP). Even though ribosomes are fundamental components for protein synthesis as they are housekeeping genes, recent studies hypothesize that theyr expression end chromatin accessibility can vary significantly among patients and tumor types also.
Through scRNA-seq (levels of expression) and scATAC-seq (to assess chromatin accessibility), the aim was to verify whether ribosomal genes alone could be sufficient to distinguish different PDX samples, revealing whether is true or not that they could be used to differentiate patients.
The dataset used for the analysis comes from GEO (NCBI) with GSE307240 code: all-together they are composed of five different PDX samples, and for each can be found a scRNA-seq (.h5) file and a scATAC-seq (.tsv.gz) file. In order to be able to compute the analysis inside a Docker container all the files were first read and then subsampled, extrapolating the data for only 500 cells for each patients: this allowed to analysis to have a balanced dataset and representative for 2500 cells
In this section are presented the results of the integration of the five PDX samples, focusing exclusively on RP genes.
The initial phase of the analysis involved managing raw scRNA-seq datasets from five distinct osteosarcoma Patient-Derived Xenograft (PDX) models. Due to the high density of the original data and the need to circumvent hardware limitations related to available RAM, a strategic subsampling approach was implemented.
The workflow consisted of the following key steps: (1) System paths were established to load counts from .h5 (Hierarchical Data Format) files, the standard output for 10x Genomics pipelines; (2) to ensure an equitable representation of each PDX model while drastically reducing the computational footprint, 500 cells were randomly selected from each patient. This allowed the analysis to focus on the ribosomal protein (RP) signature without exhausting system resources; (3) individual Seurat objects were merged into a single integrated dataset. To maximize efficiency, the data were stored as sparse matrices, a data structure that only records non-zero values, essential for scRNA-seq.
p1_path <- "../data/RNA/h5/GSM9219737_PDX_OS107_filtered_feature_bc_matrix.h5"
p2_path <- "../data/RNA/h5/GSM9219739_PDX_OS526_filtered_feature_bc_matrix.h5"
p3_path <- "../data/RNA/h5/GSM9219741_PDX_OS774_filtered_feature_bc_matrix.h5"
p4_path <- "../data/RNA/h5/GSM9219743_PDX_OS833_filtered_feature_bc_matrix.h5"
p5_path <- "../data/RNA/h5/GSM9219745_PDX_384_atac_filtered_feature_bc_matrix.h5"
p1_data <- Read10X_h5(p1_path)
p2_data <- Read10X_h5(p2_path)
p3_data <- Read10X_h5(p3_path)
p4_data <- Read10X_h5(p4_path)
p5_data <- Read10X_h5(p5_path)
p1_obj <- prepare_pdx_rna(p1_path, "P1", n_cells = 500)
p2_obj <- prepare_pdx_rna(p2_path, "P2", n_cells = 500)
p3_obj <- prepare_pdx_rna(p3_path, "P3", n_cells = 500)
p4_obj <- prepare_pdx_rna(p4_path, "P4", n_cells = 500)
p5_obj <- prepare_pdx_rna(p5_path, "P5", n_cells = 500)
path_rna <- "../data/RNA/"
saveRDS(p1_obj, paste0(path_rna, "p1_OS107_500cells.rds"))
saveRDS(p2_obj, paste0(path_rna, "p2_OS526_500cells.rds"))
saveRDS(p3_obj, paste0(path_rna, "p3_OS774_500cells.rds"))
saveRDS(p4_obj, paste0(path_rna, "p4_OS833_500cells.rds"))
saveRDS(p5_obj, paste0(path_rna, "p5_384_500cells.rds"))
merged_5pazienti_RNA <- merge(p1_OS107, y = c(p2_OS526, p3_OS774, p4_OS833, p5_384),
add.cell.ids = c("P1", "P2", "P3", "P4", "P5"),
project = "PDX_Integrated")
saveRDS(merged_5pazienti_RNA, "../data/RNA/merged_5pazienti_RNA.rds")
esporta_matrice_10x(merged_5pazienti_RNA)
Once the integrated sparse matrix was generated, the analysis was narrowed down to specifically target the expression profiles of human Ribosomal Protein (RP) genes. This targeted approach allows for the identification of patient-specific translational signatures within the PDX models.
To perform filtering and clustering the human genome annotation file (Ensembl release 115, GRCh38) has been used to identify and isolate specific genomic features. The GTF was edited to retain only RP genes and associated pseudogenes, specifically targeting those with the “RP” prefix in their nomenclature.
Using the custom-filtered annotation, the expression matrix was subsetted to include only the selected ribosomal features.
The resulting object underwent a standard scRNA-seq pipeline:
An Elbow Plot (Fig. 1a) was generated to visualize the variance explained by each Principal Component (PC). This allowed for an informed selection of the optimal number of dimensions for downstream analysis.
After selecting the most significant PCs, a UMAP projection (Fig. 1b) was created. The visualization demonstrates a clear patient-specific signature, where cells cluster primarily by their PDX model of origin based solely on the expression of ribosomal protein genes.
risultato_pdx <- readRDS("../data/RNA/merged_5pazienti_RNA.rds")
path_gtf <- "../data/RNA/Homo_sapiens.GRCh38.115.gtf.gz"
risultato_pdx <- rimuovi_geni_topo(risultato_pdx)
solo_rp <- filtra_ribosomiali(risultato_pdx, path_gtf)
risultato_pdx <- analizza_rna_ribo(solo_rp)
stile_titolo <- theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5))
p1 <- Seurat::ElbowPlot(risultato_pdx, ndims = 20) +
ggtitle("1a. PCA Elbow Plot (Selection of Dimensions)") + stile_titolo
p2 <- Seurat::DimPlot(risultato_pdx, reduction = "umap", group.by = "orig.ident") +
ggtitle("1b. Patient Specific Signature (RP Genes)") + stile_titolo
(p1 | p2)
To evaluate whether the result obtained previously was actually due to differences in RPs, a sweep of multiple clustering settings has been performed, analyzing neighbors, dimensionality (the number of PCs) and resolution (clustering based on cells granularity).
For dimensionality control three graphs have been produced: one with 2 dimensions, one with 10 dimensions and one with 50 dimensions.
The 2PC dimensionality reduction is not sufficient to correctly separate biological profiles; while the 50PC plot in figure 3b, when allowing 50 dimentions, causes technical noise that fragments the clusters. The optimal dimensional reduction was obtained with a dimensionality of 10PC.
For the resolution sweep the of 0.1, 0.2 and 1.2 resolutions were tested: the 0.1 one allows the identification of macro-differences between patients but is not enough to distinguish all five PDXs, while a 1.2 resolution cause over-clustering which is not supported by real biological differences. The 0.2 resolution represent the perfect equilibrium that allows for the discrimination of all five patient-profiles without fragmenting the clusters.
stile_titolo <- theme(plot.title = element_text(size = 10, face = "bold", hjust = 0.5))
# SWEEP DELLA RISOLUZIONE
res_base <- FindNeighbors(risultato_pdx, dims = 1:10, verbose = FALSE)
p_res_low <- DimPlot(FindClusters(res_base, resolution = 0.1, verbose = FALSE), group.by = "seurat_clusters") +
ggtitle("2a. Resolution: 0.1\n(under-clustering)") + stile_titolo
p_res_mid <- DimPlot(FindClusters(res_base, resolution = 0.2, verbose = FALSE), group.by = "seurat_clusters") +
ggtitle("2b. Resolution: 0.2\n(optimal)") + stile_titolo
p_res_high <- DimPlot(FindClusters(res_base, resolution = 1.2, verbose = FALSE), group.by = "seurat_clusters") +
ggtitle("2c. Resolution: 1.2\n(over-clustering)") + stile_titolo
# SWEEP DELLE DIMENSIONI (PC)
p_2pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:2, verbose = FALSE), group.by = "orig.ident") +
ggtitle("2d. Dimensionality: 2 PC") + stile_titolo
p_10pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:10, verbose = FALSE), group.by = "orig.ident") +
ggtitle("2e. Dimensionality: 10 PC") + stile_titolo
p_20pc <- DimPlot(RunUMAP(risultato_pdx, dims = 1:20, verbose = FALSE), group.by = "orig.ident") +
ggtitle("2f. Dimensionality: 20 PC") + stile_titolo
p_2pc <- p_2pc + theme(legend.position = "bottom")
p_10pc <- p_10pc + theme(legend.position = "none")
p_20pc <- p_20pc + theme(legend.position = "none")
(p_res_low | p_res_mid | p_res_high) / (p_2pc | p_10pc | p_20pc)
This parameter sweep shows that the optimal configuration for RP gene analysis has a PC of 10 and resolution of 0.1. This setup allows to balance biological differences avoiding technical noise cause by an excessive number of dimensions or due to an artificial fragmentation of the clusters.
The chromatin accessibility workflow utilized the Signac framework. Before going to the scATAC-seq data, it was necessary to create a GenomicRanges (GRanges) object: this preparative passage allows for the transformation of textual coordinates in computational objects, and specifically in this analysis they were needed to define the specific genomic coordinates of interest for the analysis of scATAC-seq. This was achieved again by using the already filtered human GTF (Ensembl release 115) file to retain only ribosomal protein genes and pseudogenes.
Technically speaking, GRanges objects are useful for their ability to manage genomic complexity because they not only memorize a sequence, but they also integrate essential metadata such as the chromosome of origin and genetic annotations.
Without GRanges, operations like genomic intervals intersections and calculating the distances between peaks will be prone to errors. GRanges solve this issue by allowing very fast overlapping and query, granting spatial coherence of the data, making the integration of different layers, such as accessibility loci of the chromatin and the correspondent RNA expression, easier.
ribo_granges <- crea_granges_ribo(solo_rp)
print(ribo_granges)
## GRanges object with 141 ranges and 22 metadata columns:
## seqnames ranges strand | source type
## <Rle> <IRanges> <Rle> | <factor> <factor>
## RPL22 1 6185020-6209389 - | ensembl_havana gene
## RPL11 1 23691685-23696835 + | ensembl_havana gene
## RPS6KA1 1 26529756-26575030 + | ensembl_havana gene
## RPA2 1 27891524-27914746 - | ensembl_havana gene
## RPS8 1 44775251-44778779 + | ensembl_havana gene
## ... ... ... ... . ... ...
## RPL36A X 101390824-101396155 + | ensembl_havana gene
## RPL39 X 119786497-119791662 - | ensembl_havana gene
## RPL10 X 154389955-154409168 + | ensembl_havana gene
## RPS4Y1 Y 2841602-2932000 + | ensembl_havana gene
## RPS4Y2 Y 20756108-20781032 + | ensembl_havana gene
## score phase gene_id gene_version gene_name
## <numeric> <integer> <character> <character> <character>
## RPL22 NA <NA> ENSG00000116251 12 RPL22
## RPL11 NA <NA> ENSG00000142676 15 RPL11
## RPS6KA1 NA <NA> ENSG00000117676 15 RPS6KA1
## RPA2 NA <NA> ENSG00000117748 11 RPA2
## RPS8 NA <NA> ENSG00000142937 13 RPS8
## ... ... ... ... ... ...
## RPL36A NA <NA> ENSG00000241343 11 RPL36A
## RPL39 NA <NA> ENSG00000198918 9 RPL39
## RPL10 NA <NA> ENSG00000147403 19 RPL10
## RPS4Y1 NA <NA> ENSG00000129824 16 RPS4Y1
## RPS4Y2 NA <NA> ENSG00000280969 2 RPS4Y2
## gene_source gene_biotype transcript_id transcript_version
## <character> <character> <character> <character>
## RPL22 ensembl_havana protein_coding <NA> <NA>
## RPL11 ensembl_havana protein_coding <NA> <NA>
## RPS6KA1 ensembl_havana protein_coding <NA> <NA>
## RPA2 ensembl_havana protein_coding <NA> <NA>
## RPS8 ensembl_havana protein_coding <NA> <NA>
## ... ... ... ... ...
## RPL36A ensembl_havana protein_coding <NA> <NA>
## RPL39 ensembl_havana protein_coding <NA> <NA>
## RPL10 ensembl_havana protein_coding <NA> <NA>
## RPS4Y1 ensembl_havana protein_coding <NA> <NA>
## RPS4Y2 ensembl_havana protein_coding <NA> <NA>
## transcript_name transcript_source transcript_biotype tag
## <character> <character> <character> <character>
## RPL22 <NA> <NA> <NA> <NA>
## RPL11 <NA> <NA> <NA> <NA>
## RPS6KA1 <NA> <NA> <NA> <NA>
## RPA2 <NA> <NA> <NA> <NA>
## RPS8 <NA> <NA> <NA> <NA>
## ... ... ... ... ...
## RPL36A <NA> <NA> <NA> <NA>
## RPL39 <NA> <NA> <NA> <NA>
## RPL10 <NA> <NA> <NA> <NA>
## RPS4Y1 <NA> <NA> <NA> <NA>
## RPS4Y2 <NA> <NA> <NA> <NA>
## transcript_support_level exon_number exon_id exon_version
## <character> <character> <character> <character>
## RPL22 <NA> <NA> <NA> <NA>
## RPL11 <NA> <NA> <NA> <NA>
## RPS6KA1 <NA> <NA> <NA> <NA>
## RPA2 <NA> <NA> <NA> <NA>
## RPS8 <NA> <NA> <NA> <NA>
## ... ... ... ... ...
## RPL36A <NA> <NA> <NA> <NA>
## RPL39 <NA> <NA> <NA> <NA>
## RPL10 <NA> <NA> <NA> <NA>
## RPS4Y1 <NA> <NA> <NA> <NA>
## RPS4Y2 <NA> <NA> <NA> <NA>
## protein_id protein_version ccds_id
## <character> <character> <character>
## RPL22 <NA> <NA> <NA>
## RPL11 <NA> <NA> <NA>
## RPS6KA1 <NA> <NA> <NA>
## RPA2 <NA> <NA> <NA>
## RPS8 <NA> <NA> <NA>
## ... ... ... ...
## RPL36A <NA> <NA> <NA>
## RPL39 <NA> <NA> <NA>
## RPL10 <NA> <NA> <NA>
## RPS4Y1 <NA> <NA> <NA>
## RPS4Y2 <NA> <NA> <NA>
## -------
## seqinfo: 70 sequences from an unspecified genome; no seqlengths
To ensure multi-omic consistency, the analysis was restricted to the exact same 500 randomly selected cells per patient used in the scRNA-seq workflow. To handle the raw epigenetic data for these specific cells, the scATAC-seq fragment files were pre-processed via system-level commands (gunzip, bgzip, and tabix). Then Tabix indexing was performed, a fundamental step for the analysis epigenetic component, primarily due to the massive and sparse nature of scATAC-seq data. While gene expression (RNA) files are relatively compact, ATAC fragment files can often reach several gigabytes in size, making it impossible to load the entire dataset into the computer’s RAM. Tabix creates a .tbi file that serves as an index or a genomic roadmap. Thanks to this index, the Signac software does not need to read the entire file from beginning to end to perform an analysis; instead, it can move to the exact genomic coordinates of the Ribosomal Protein (RP) genes. Without this block-compressed indexing (bgzip), querying the regions of interest would require prohibitive processing times and would inevitably lead to system crashes due to resource exhaustion.
The data were normalized using TF-IDF (Term Frequency-Inverse Document Frequency) and reduced via SVD (Singular Value Decomposition), a process known as LSI (Latent Semantic Indexing).
p1_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219736_PDX_OS107_atac_fragments.tsv"
p2_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219738_PDX_OS526_atac_fragments.tsv"
p3_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219740_PDX_OS774_atac_fragments.tsv"
p4_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219742_PDX_OS833_atac_fragments.tsv"
p5_ATAC_path <- "../data/ATAC/.tsv.gz/GSM9219744_PDX_384_atac_fragments.tsv"
# INDICIZZAZIONE
library(Rsamtools)
all_paths <- c(p1_ATAC_path, p2_ATAC_path, p3_ATAC_path, p4_ATAC_path, p5_ATAC_path)
for (path in all_paths) {
if (!file.exists(paste0(path, ".tbi"))) {
system(paste("tabix -p bed", path))
}
}
# Creazione degli oggetti ATAC
p1_OS107_ATAC_ribo <- prepare_pdx_atac(p1_ATAC_path, ribo_granges, colnames(p1), "P1")
p2_OS526_ATAC_ribo <- prepare_pdx_atac(p2_ATAC_path, ribo_granges, colnames(p2), "P2")
p3_OS774_ATAC_ribo <- prepare_pdx_atac(p3_ATAC_path, ribo_granges, colnames(p3), "P3")
p4_OS833_ATAC_ribo <- prepare_pdx_atac(p4_ATAC_path, ribo_granges, colnames(p4), "P4")
p5_384_ATAC_ribo <- prepare_pdx_atac(p5_ATAC_path, ribo_granges, colnames(p5), "P5")
atac_merged_con_clusters <- merge(p1_OS107_ATAC_ribo, y = c(p2_OS526_ATAC_ribo, p3_OS774_ATAC_ribo, p4_OS833_ATAC_ribo, p5_384_ATAC_ribo),
add.cell.ids = c("P1", "P2", "P3", "P4", "P5"))
As for scRNA-seq data, again the objective was to determine whether the patient-specific patterns observed in gene expression were also rooted in the chromatin architecture. By focusing on the accessibility of genomic regions corresponding to RP genes, the aim was to identify whether the “translational identity” of each PDX model is already encoded at the level of open chromatin before transcription occurs.
The resulting UMAP projection (Fig. 3) provides a high-resolution map of the epigenetic landscape of the 2,500 subsampled cells.
Several key observations can be made: (1) much like the scRNA-seq results, the cells do not form a single, uniform cluster. Instead, they organize into distinct groups that correlate strongly with the original patient identity. This indicates that chromatin accessibility at ribosomal loci is not a generic “housekeeping” feature but is highly divergent between different tumor models; (2) the clear separation suggests that each PDX model possesses a unique “epigenetic fingerprint.” For instance, if Patient A’s cells are localized far from Patient B’s cells on the UMAP, it implies that the regulatory elements (promoters and enhancers) governing their ribosomal proteins have significantly different levels of openness; (3) the fact that the ATAC-seq data mirrors the RNA-seq clustering confirms the robustness of the RP signature. It proves that the heterogeneity we observe is a biological reality of the PDX models and not a technical artifact of a single sequencing method.
From a biological standpoint, this visualization supports the hypothesis of “tumor-specialized ribosomes.” It suggests that different tumors maintain different “epigenetic programs” for their protein synthesis machinery, potentially to adapt to specific metabolic or oncogenic needs.
atac_obj <- carica_oggetto_atac("../data/ATAC/atac_merged_con_clusters.rds")
atac_obj <- processa_pdx_atac(atac_obj)
Seurat::ElbowPlot(atac_obj, ndims = 15, reduction = "lsi") +
ggtitle("3a. Dimensionality Validation: LSI Elbow Plot")
DimPlot(atac_obj, group.by = "orig.ident") + ggtitle("3b. Patient Specific ATAC Signature")
Dimensionality reduction was performed using Latent Semantic Indexing (LSI). We selected dimensions 2 through 15 for the downstream UMAP projection and clustering. The first LSI component was excluded as it often captures technical variation related to sequencing depth rather than biological signal. The range of 15 dimensions was chosen as a robust compromise to capture the inter-patient heterogeneity of the PDX models while minimizing stochastic noise.
To ensure the robustness of the scATAC-seq analysis, a dual-parameter sweeping, using the same concept that was used for scRNA-seq sweeping, has been performed. This step is crucial to verify that the patient-specific clustering is a biological reality and not an artifact of specific parameter choices.
In the first row of Fig. 4a-c, three different ranges of Latent Semantic Indexing (LSI) components have been tested to define the UMAP space, consistently excluding the first dimension to avoid technical bias related to sequencing depth:
Under-fitting (2:5 LSI): Using too few dimensions results in a compressed projection (Fig. 4a) where the 5 PDX models are forced together, losing the fine-grained epigenetic differences between patients.
Optimal (2:15 LSI): Based on the Elbow Plot, this range (Fig. 4b) provides the clearest separation. The cells organize into distinct clouds corresponding to their patient of origin (orig.ident), confirming that the ribosomal protein (RP) chromatin accessibility signature is highly patient-specific.
Noise Inclusion (2:50 LSI): Increasing the dimensions further (Fig. 4c) does not improve separation; instead, it introduces stochastic noise that begins to blur the cluster boundaries, leading to a less defined global structure.
In the second row of Fig. 4d-f, Resolution Sweep is performed, using the optimal dimensionality (2:15 LSI). The resolution parameter tested were 0.1, 0.5 and 1,2, in order to find the most biologically meaningful granularity:
Broad Clusters (Res 0.1): This setting is too conservative (Fig. 4d), grouping all cells into a single large cluster and failing to recognize the inherent inter-patient heterogeneity.
Optimal (Res 0.5): This resolution (Fig. 4e) perfectly mirrors the experimental design, identifying 5 distinct clusters that map directly to the 5 PDX models. This “gold standard” confirms that the RP signature is the primary driver of epigenetic variation in our dataset.
Over-clustering (Res 1.2): High resolution (Fig. 4f) artificially fragments the data into 13 clusters. These sub-groups likely represent technical noise or minor cell-state variations rather than significant biological differences in the ribosomal chromatin program.
stile_titolo <- theme(plot.title = element_text(size = 12, face = "bold", hjust = 0.5))
# SWEEP DELLE DIMENSIONI (LSI)
p_low_dim <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:5, verbose = FALSE),
group.by = "orig.ident") +
ggtitle("4a. Dimensionality: 2:5 LSI (under-fitting)") + stile_titolo
p_mid_dim <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:15, verbose = FALSE),
group.by = "orig.ident") +
ggtitle("4b. Dimensionality: 2:15 LSI (optimal)") + stile_titolo
p_high_dim <- DimPlot(RunUMAP(atac_obj, reduction = "lsi", dims = 2:50, verbose = FALSE),
group.by = "orig.ident") +
ggtitle("4c. Dimensionality: 2:50 LSI (noise inclusion)") + stile_titolo
# SWEEP DELLA RISOLUZIONE
res_base_atac <- FindNeighbors(atac_obj, reduction = "lsi", dims = 2:15, verbose = FALSE)
p_res_low <- DimPlot(FindClusters(res_base_atac, resolution = 0.1, verbose = FALSE),
group.by = "seurat_clusters") +
ggtitle("4d. Resolution: 0.1 (broad clusters)") + stile_titolo
p_res_mid <- DimPlot(FindClusters(res_base_atac, resolution = 0.5, verbose = FALSE),
group.by = "seurat_clusters") +
ggtitle("4e. Resolution: 0.5 (optimal)") + stile_titolo
p_res_high <- DimPlot(FindClusters(res_base_atac, resolution = 1.2, verbose = FALSE),
group.by = "seurat_clusters") +
ggtitle("4f. Resolution: 1.2 (over-clustering)") + stile_titolo
(p_low_dim | p_mid_dim | p_high_dim) / (p_res_low | p_res_mid | p_res_high)
The integration of these results justifies the final selection of dimensions 2:15 and resolution 0.5. This configuration captures the maximum biological signal with minimum noise, demonstrating that each patient maintains a unique and stable “epigenetic fingerprint” regarding ribosomal protein gene accessibility.
To conclude the validation of our scATAC-seq analysis, we generate a contingency table to objectively quantify the alignment between the computational clusters and the biological origin of the cells.
The ultimate goal of testing multiple clustering parameters is to assess whether PDX samples can be separated based on the chromatin accessibility state of ribosomal protein loci. While UMAP visualizations provide a qualitative overview, a contingency table provides the statistical proof of this separation.
This table cross-references the clusters identified by the algorithm (at the optimal resolution of 0.5) with the original patient metadata . A “clean” separation is indicated by a one-to-one (or nearly one-to-one) correspondence, where each cluster is almost exclusively composed of cells from a single patient. This confirms that the ribosomal protein (RP) chromatin signature is a robust, patient-specific feature rather than a result of stochastic noise or technical batch effects.
| OS107 | OS384 | OS526 | OS774 | OS833 | |
|---|---|---|---|---|---|
| 0 | 343 | 218 | 338 | 274 | 287 |
| 1 | 56 | 78 | 54 | 115 | 94 |
| 2 | 55 | 131 | 36 | 27 | 25 |
| 3 | 15 | 56 | 34 | 79 | 71 |
| 4 | 31 | 17 | 38 | 5 | 23 |
During the initial stages of the analysis, the RStudio environment encountered continuous crashes and session restarts, particularly when reading and also during the merging of high-dimensional scRNA-seq and scATAC-seq datasets from multiple patients.
By monitoring the system’s resource allocation in real-time using the system(“free -m”) command, i discovered a critical memory bottleneck, as the process was hitting the default hardware limits of the virtualized environment.
To fix this, i had to manually tell Windows to give more ‘breathing room’ to the Linux system (WSL2) that runs Docker. To resolve the issue i had to go into the main User folder on the computer and created a simple text file named .wslconfig. In this file, i changed the default limits by writing these exact settings:
[wsl2] memory=6GB swap=4GB
After restarting the PC, the system finally had enough memory to handle the large datasets without crashing.
This adjustment was also supported by the use of manually triggered garbage collection (gc()) within the R script, ensuring that the heavy computational tasks—such as fragment object creation and large-scale merging—could be completed successfully without further system failures.
Data Management and Storage Challenges
Furthermore, during the final stages of the project, significant challenges were encountered regarding data storage and synchronization across platforms:
GitHub Repository Constraints: While uploading the contents of the Progetto_Bioinformatica_2026 folder to GitHub, the process failed repeatedly due to file size limitations. To resolve this, all RAW data files for scRNA-seq and scATAC-seq (specifically .h5 and .tsv.gz formats) were excluded from the repository. These files are not actively required for the final “Knit” compilation—which relies on processed objects—and their exclusion was also a strategic choice to prevent further RAM allocation issues.
Optimization of Expression Matrices: The matrix.mtx file, generated from the merged 5-patient PDX samples following the random sampling of 500 cells per sample, exceeded the upload limits on the repository in GitHub, consequently, it was replaced by a more efficient .rds file. This format provides a highly compressed yet accessible representation of the data, ensuring the analysis remains fully reproducible without the overhead of heavy sparse matrix files.
Docker Hub Deployment: Similarly, when pushing the final environment to Docker Hub, the raw data, .h5, and compressed .tsv files were omitted. This optimization was necessary to maintain a manageable image size, ensuring that the container remains portable and focuses exclusively on the processed data structures and the pre-installed software dependencies required to execute the report.
In conclusion, these adjustments in data handling and system configuration were essential to overcome hardware limitations, ensuring a stable and reproducible workflow within both the GitHub and Docker ecosystems.